Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
Generative modeling of human motion has broad applications in computer animation, virtual reality, and robotics. Conventional approaches develop separate models for different motion synthesis tasks, and typically use a model of a small size to avoid overfitting the scarce data available in each setting. It remains an open question whether developing a single unified model is feasible, which may 1) benefit the acquirement of novel skills by combining skills learned from multiple tasks, and 2) help in increasing the model capacity without overfitting by combining multiple data sources. Unification is challenging because 1) it involves diverse control signals as well as targets of varying granularity, and 2) motion datasets may use different skeletons and default poses. In this paper, we present MoFusion, a framework for unified motion synthesis. MoFusion employs a Transformer backbone to ease the inclusion of diverse control signals via cross attention, and pretrains the backbone as a diffusion model to support multi-granularity synthesis ranging from motion completion of a body part to whole-body motion generation. It uses a learnable adapter to accommodate the differences between the default skeletons used by the pretraining and the fine-tuning data. Empirical results show that pretraining is vital for scaling the model size without overfitting, and demonstrate MoFusion's potential in various tasks, e.g., text-to-motion, motion completion, and zero-shot mixing of multiple control signals. Project page: \url{https://ofa-sys.github.io/MoFusion/}.
translated by 谷歌翻译
The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper.
translated by 谷歌翻译
基于图像的3D检测是自主驾驶感知系统的必不可少的组成部分。但是,它仍然受到不满意的表现,这是有限的培训数据的主要原因之一。不幸的是,在3D空间中注释对象是极度时间/资源消耗的,这使得很难任意扩展训练集。在这项工作中,我们专注于半监督的方式,并探索更便宜的替代方案(即伪标记)的可行性,以利用未标记的数据。为此,我们进行了广泛的实验,以研究伪标签是否可以在不同环境下为基线模型提供有效的监督。实验结果不仅证明了基于图像的3D检测的伪标记机制的有效性(例如,在单眼设置下,我们在没有铃铛和哨声的Kitti-3D测试集上实现了20.23 AP,用于中等水平,从6.03 AP),但还显示了几个有趣且值得注意的发现(例如,经过伪标签训练的模型的性能要比基于相同培训数据的地面真相注释训练的表现更好)。我们希望这项工作可以在半监督环境下为基于图像的3D检测社区提供见解。代码,伪标签和预培训模型将公开可用。
translated by 谷歌翻译
虚拟试验旨在在店内服装和参考人员图像的情况下产生光真实的拟合结果。现有的方法通常建立多阶段框架来分别处理衣服翘曲和身体混合,或严重依赖基于中间解析器的标签,这些标签可能嘈杂甚至不准确。为了解决上述挑战,我们通过开发一种新型的变形注意流(DAFLOF)提出了一个单阶段的尝试框架,该框架将可变形的注意方案应用于多流量估计。仅将姿势关键点作为指导,分别为参考人员和服装图像估计了自我和跨跨性别的注意力流。通过对多个流场进行采样,通过注意机制同时提取并合并了来自不同语义区域的特征级和像素级信息。它使衣服翘曲和身体合成,同时以端到端的方式导致照片真实的结果。在两个尝试数据集上进行的广泛实验表明,我们提出的方法在定性和定量上都能达到最先进的性能。此外,其他两个图像编辑任务上的其他实验说明了我们用于多视图合成和图像动画方法的多功能性。
translated by 谷歌翻译
旨在识别来自子类别的对象的细粒度视觉分类(FGVC)是一个非常具有挑战性的任务,因为固有的微妙级别差异。大多数现有工程主要通过重用骨干网络来提取检测到的歧视区域的特征来解决这个问题。然而,该策略不可避免地使管道复杂化并推动所提出的区域,其中大多数物体的大多数部分未能定位真正重要的部分。最近,视觉变压器(VIT)在传统的分类任务中表现出其强大的表现。变压器的自我关注机制将每个补丁令牌链接到分类令牌。在这项工作中,我们首先评估vit框架在细粒度识别环境中的有效性。然后,由于注意力的强度,可以直观地被认为是令牌重要性的指标,我们进一步提出了一种新颖的部分选择模块,可以应用于我们整合变压器的所有原始注意力的变压器架构进入注意地图,用于指导网络以有效,准确地选择鉴别的图像斑块并计算它们的关系。应用对比损失来扩大混淆类的特征表示之间的距离。我们将基于增强的变压器的模型Transfg命名,并通过在我们实现最先进的绩效的五个流行的细粒度基准测试中进行实验来展示它的价值。提出了更好地理解模型的定性结果。
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
For Prognostics and Health Management (PHM) of Lithium-ion (Li-ion) batteries, many models have been established to characterize their degradation process. The existing empirical or physical models can reveal important information regarding the degradation dynamics. However, there is no general and flexible methods to fuse the information represented by those models. Physics-Informed Neural Network (PINN) is an efficient tool to fuse empirical or physical dynamic models with data-driven models. To take full advantage of various information sources, we propose a model fusion scheme based on PINN. It is implemented by developing a semi-empirical semi-physical Partial Differential Equation (PDE) to model the degradation dynamics of Li-ion-batteries. When there is little prior knowledge about the dynamics, we leverage the data-driven Deep Hidden Physics Model (DeepHPM) to discover the underlying governing dynamic models. The uncovered dynamics information is then fused with that mined by the surrogate neural network in the PINN framework. Moreover, an uncertainty-based adaptive weighting method is employed to balance the multiple learning tasks when training the PINN. The proposed methods are verified on a public dataset of Li-ion Phosphate (LFP)/graphite batteries.
translated by 谷歌翻译
Hybrid unmanned aerial vehicles (UAVs) integrate the efficient forward flight of fixed-wing and vertical takeoff and landing (VTOL) capabilities of multicopter UAVs. This paper presents the modeling, control and simulation of a new type of hybrid micro-small UAVs, coined as lifting-wing quadcopters. The airframe orientation of the lifting wing needs to tilt a specific angle often within $ 45$ degrees, neither nearly $ 90$ nor approximately $ 0$ degrees. Compared with some convertiplane and tail-sitter UAVs, the lifting-wing quadcopter has a highly reliable structure, robust wind resistance, low cruise speed and reliable transition flight, making it potential to work fully-autonomous outdoor or some confined airspace indoor. In the modeling part, forces and moments generated by both lifting wing and rotors are considered. Based on the established model, a unified controller for the full flight phase is designed. The controller has the capability of uniformly treating the hovering and forward flight, and enables a continuous transition between two modes, depending on the velocity command. What is more, by taking rotor thrust and aerodynamic force under consideration simultaneously, a control allocation based on optimization is utilized to realize cooperative control for energy saving. Finally, comprehensive Hardware-In-the-Loop (HIL) simulations are performed to verify the advantages of the designed aircraft and the proposed controller.
translated by 谷歌翻译
Dynamic Graph Neural Networks (DGNNs) have been broadly applied in various real-life applications, such as link prediction and pandemic forecast, to capture both static structural information and temporal characteristics from dynamic graphs. Combining both time-dependent and -independent components, DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead under the canonical one-graph-at-a-time training pattern. To tackle the challenges, we propose PiPAD, a $\underline{\textbf{Pi}}pelined$ and $\underline{\textbf{PA}}rallel$ $\underline{\textbf{D}}GNN$ training framework for the end-to-end performance optimization on GPUs. From both the algorithm and runtime level, PiPAD holistically reconstructs the overall training paradigm from the data organization to computation manner. Capable of processing multiple graph snapshots in parallel, PiPAD eliminates the unnecessary data transmission and alleviates memory access inefficiency to improve the overall performance. Our evaluation across various datasets shows PiPAD achieves $1.22\times$-$9.57\times$ speedup over the state-of-the-art DGNN frameworks on three representative models.
translated by 谷歌翻译